Hello all,
I have a column of co-occurring mutations by each observation and need to split this into multiple columns, each split corresponding to one mutation per obs. I managed to split the first and second/last but I am stumbling on going further picking the first instance of a ")".
The string is too big for dataex to export but a sample of the column 'Cooccurmutn' is like this. The pattern I find useful is:
CUX1 Loss - EquivocalWT1 Loss - EquivocalAPC Loss EquivocalBRCA1 Loss - EquivocalEZH2 Loss - EquivocalNF1 Loss
SF3B1 c.2098A>G, p.K700E (NM_012433.4) (VAF: 12%)WT1 c.1156_1159dup, p.A387Vfs*4 (NM_024426.6) (VAF: 5%)ATR c.2634-1G>A, p.? (NM_001184.4) (VAF: 15%)STK11 Loss - Equivocal
gen com = ustrregexs(0) if ustrregexm(Cooccurmutn, "(^[A-Z].*[a-z]$)") // first comutation
replace com = ustrregexs(0) if ustrregexm(Cooccurmutn, "^[A-Z].*([a-z]|\))$") & missing(com) // first comutation
replace com = ustrregexs(1) if ustrregexm(Cooccurmutn, "(^[A-Z]*.*[a-z]{4})([A-Z]*.*)") & missing(com) // first comutation
replace com = ustrregexs(1) if ustrregexm(Cooccurmutn, "(^[A-Z]*.*([0-9]|%)\))([A-Z]*.*)") & missing(com) // first
gen com2 =ustrregexra(Cooccurmutn, "(^[A-Z]*.*([0-9]|%)\))","") // second through end
What I want for the first row/obs would be:
comut1 comut2 comut3
DNMT3A.* | IDH2.* |JAK2.*
Alternatively, I had a file with master list of all human genes (hugolist.dta) which I was hoping to match and extract by looking up each row in my file to the column of genes in the genelist file, but don't know if that is easier in R or perhaps Stata. I am rusty with text functions in R though.
I have a column of co-occurring mutations by each observation and need to split this into multiple columns, each split corresponding to one mutation per obs. I managed to split the first and second/last but I am stumbling on going further picking the first instance of a ")".
The string is too big for dataex to export but a sample of the column 'Cooccurmutn' is like this. The pattern I find useful is:
- gene names are always upper case starting letter and
- follow either a lower case letter ending or a ")".
CUX1 Loss - EquivocalWT1 Loss - EquivocalAPC Loss EquivocalBRCA1 Loss - EquivocalEZH2 Loss - EquivocalNF1 Loss
SF3B1 c.2098A>G, p.K700E (NM_012433.4) (VAF: 12%)WT1 c.1156_1159dup, p.A387Vfs*4 (NM_024426.6) (VAF: 5%)ATR c.2634-1G>A, p.? (NM_001184.4) (VAF: 15%)STK11 Loss - Equivocal
gen com = ustrregexs(0) if ustrregexm(Cooccurmutn, "(^[A-Z].*[a-z]$)") // first comutation
replace com = ustrregexs(0) if ustrregexm(Cooccurmutn, "^[A-Z].*([a-z]|\))$") & missing(com) // first comutation
replace com = ustrregexs(1) if ustrregexm(Cooccurmutn, "(^[A-Z]*.*[a-z]{4})([A-Z]*.*)") & missing(com) // first comutation
replace com = ustrregexs(1) if ustrregexm(Cooccurmutn, "(^[A-Z]*.*([0-9]|%)\))([A-Z]*.*)") & missing(com) // first
gen com2 =ustrregexra(Cooccurmutn, "(^[A-Z]*.*([0-9]|%)\))","") // second through end
What I want for the first row/obs would be:
comut1 comut2 comut3
DNMT3A.* | IDH2.* |JAK2.*
Alternatively, I had a file with master list of all human genes (hugolist.dta) which I was hoping to match and extract by looking up each row in my file to the column of genes in the genelist file, but don't know if that is easier in R or perhaps Stata. I am rusty with text functions in R though.
Comment